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Abstract 

We derive tight bounds on the cache misses for evaluation of explicit stencil oper- 
ators on structured grids. Our lower bound is based on the isoperimetrical property 
of the discrete octahedron. Our upper bound is based on a good surface to volume 
ratio of a parallelepiped spanned by a reduced basis of the interference lattice of a 
grid. Measurements show that our algorithm typically reduces the number of cache 
misses by a factor of three, relative to a compiler optimized code. We show that stencil 
calculations on grids whose interference lattice have a short vector feature abnormally 
^ high numbers of cache misses. We call such grids unfavorable and suggest to avoid 

\ these in computations by appropriate padding. By direct measurements on a MIPS 

R10000 processor we show a good correlation between abnormally high numbers of 
. cache misses and unfavorable three-dimensional grids. 

o 
o 

o 

On modern computers the gap between access times to cache and to global memory amounts 
to several orders of magnitude, and is growing. As a result, improvement in usage of the 
memory hierarchy has become a significant source of enhancing application performance. 
Well-organized data traffic may improve performance of a program, without changing the 
actual amount of computation, by reducing the time the processor stalls waiting for data. 
Both data location and access patterns affect the amount of data movement in the program, 
and the effectiveness of the cache. 

A number of techniques for improvements in usage of data caches have been developed 
in recent years. The techniques include improvements in data reuse (i.e. temporal locality) 
[|, |5|, |l3 |, improvements in data locality (i.e. spatial locality) [13[], and reductions in 



1 Introduction 



conflicts in data accesses P, fjj Pi |10[|. In practice, these techniques are implemented through 
code and data transformations such as array padding and loop unrolling, tiling, and fusing. 
Tight lower and upper bounds on memory hierarchy access complexity for FFT and matrix 
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multiplication algorithms are given in ||. However, questions concerning bounds on the 
number of cache misses and how closely current optimization techniques approach those 
bounds for stencil operators remain open. 

In this paper we consider improvement of cache usgae through maximizing temporal 
locality in evaluations of explicit stencil operators on structured discretization grids. Our 
contribution if twofold. First, we prove lower and upper bounds on the number of cache 
misses for local operators on structured grids. Our lower bound (i.e. the number of un- 
avoidable misses) is based on the discrete isoperimetric theorem. Our upper bound (i.e. the 
achievable number of misses) is based on a cache fitting algorithm which utilizes a special 
basis of the grid interference lattice. As shown by example, the lower bound can be achieved 
in some cases. The second contribution is the identification of grids unfavorable dimensions 
which cause significant increases in cache misses. We provide two characterizations of these 
unfavorable grids. The first one, derived experimentally, states that the product of all rele- 
vant grid dimensions is close to a multiple of half the cache size. The second characterization 
is that the grid interference lattice has a short vector. 

2 Cache model and definitions 

We consider a single-level, virtual-address-mapped, set-associative data cache memory, see 
J?|. The memory is organized in a sets of z lines of w words each. Hence, it can be 
characterized by the parameter triplet (a, z, w), and its size S equals a* z* w words. A cache 
with parameters (a, l,w) is called fully associative, and with parameters (l,z,w) it is called 
direct-mapped. 

The cache memory is used as a temporary fast storage of words used for processing. A 
word at virtual address A is fetched into a (a(A), z(A), w(A)) cache location, where w(A) = 
A mod w, z(A) = (a/w) mod z, and a(A) is determined according to a replacement policy 
(usually a variation of least recently used). The replacement policy is not important within 
the scope of this paper. 

If a word is fetched, then w — 1 neighboring words are fetched as well to fill the cache line 
completely. In practice, a, z, and w are often powers of 2 in order to simplify computation 
of the location in cache. For example, the MIPS R10000 processor for which we report some 
measurements in Section |6|, has a cache with parameters (2,512,4), which makes S equal to 
4K double precision words, or 32KB. 

Our lower bound for the minimum number of cache misses that must be suffered during 
a stencil computation holds for any cache, including fully associative caches. The upper 
bound shows that a particular number of cache misses can be achieved by choosing a special 
sequence of computations. A cache miss is defined as a request for a word of data that is 
not present in the cache at the time of the request. A cache load is defined as an explicit 
request for a word of data for which no explicit request has been made previously (a cold 
load), or whose residence in the cache has expired because of a cache load of another word 
of data into the exact same location in the cache (a replacement load). The definitions of 
cold and replacement loads match those of cold and replacement cache misses, respectively 
[Q , and if w equals 1 they completely coincide. 

If a piece of code features <fi cache misses and \x cache loads, it can easily be shown that 
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fi < w<f). For a code with good spatial locality we typically have /i ~ w<p. As can be shown by 
a simple example, no bound of the form < cfi (c constant) can be derived for arbitrary code 
segments, but if the code implements a non-redundant stencil operation, we have < \K\ji, 
where \K\ is the total number of points within the stencil. This is shown as follows. Let the 
stencil operation be written as g(x) = Ku(x), with x 6 Q. Here Q is the (not necessarily 
contiguous) point set on which array q is evaluated. Let Q be the i^-extension of Q, which 
is the point set on which u must be defined in order to compute q at all points of Q. The 
total distinct number of elements of u used is \Q\. The number of cache misses does not 
exceed the total number of accesses to array u (may included repeated accesses to the same 
element), which equals so \Vt\ < \K\\Q\. Consequently, we have the following interval 

inequality: l-ft'l -1 < ^ < w, which can be used to bound the number of cache misses in terms 
of the number of cache loads. 



3 A lower bound for cache loads for local operators 

In this section we consider the following problem: for a given <i-dimensional structured grid 
and a local stencil operator K, how many cache loads have to be incurred in order to compute 
q = Ku, where q and u are two arrays defined on the grid. We will provide a lower bound fi 
which asserts that, regardless of the order the grid points are visited for the computation of q, 
at least n cache loads have to occur. In the next section we provide a cache fitting algorithm 
for the computation of q whose number of cache loads closely approaches the lower bound. 

We use the following terminology to describe the operator K. The vectors ki, ... ,k s 
defined such that (?(x), the value of q at the grid point identified by the vector x, is a function 
of the values u(x + ki), . . . ,u(x + k s ), are called stencil vectors. Locality of K means that 
the stencil vectors are contained in a cube {x| \xi\ < r, i — 1, . . . , d} (r is called the radius of 
K, and 2r + l its diameter). In this section we assume that K contains only the star stencil 
(i.e. the {0, e 1; . . . , e d , — e 1; . . . , — e d } stencil). A lower bound for cache loads for the star 
stencil will give us a lower bound for any stencil containing it. 

Let q be computed in the ^-interior R of a rectangular region (a grid) G. We assume 
that computation of q is performed in a pointwise fashion, that is, at any grid point the 
value of q is computed completely before computation of the value of q at another point is 
started. In order to compute the value of q at a grid point x, the values of u in neighbor 
points of x must be loaded into the cache (a point y is a neighbor of x if y — x is a stencil 
vector of K). If x is a neighbor of y and u(y) has been loaded in cache to compute g(z) but 
is dropped from the cache before g(x) is computed, then u(y) must be reloaded, resulting in 
a replacement load associated with x. 

To estimate the number of elements, p, of array u that must be replaced, we choose a 
partition of R into a disjoint union of grid regions Ri, with R = uf =1 i?j, in such a way that q 
is computed in all points of Ri before it is computed at any point of Ri+i, see Figure |1|. Let 
Bij be the set of points in Rj which are neighbors of Ri. Since the star stencil is symmetrical, 
the are neighbor points of Bji. Because any point of By can have at most 2d neighbors 
in Bji, we have the following inequalities: 

^1^1 < \B ji \<2d\B ij \. (1) 
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Figure 1: The boundaries Bij of already computed values of q in a sequence of regions Ri. 
Reloading of some values of u on the boundary of R% results in at least max( | B 3 i | + 1 B 32 \ — S, 0) 
cache loads. 

For computation of q in Ri we have to replace at least values of u, where pi equals 
max ^2j-=i — S, . The total number of replaced values in the course of computing q 

on the entire grid will be at least p, where p equals B — kS, and B equals Yli=i Sj=i 
Summing all the terms \Bij\, taking into account Equation ([I]) and the fact that \Bu\ = 0, 
we get: 



B > 



(2) 



1=1 j=l 



Let 5Ri be the exterior boundary of Ri, that is, all points neighbor to Ri not belonging 
to Ri, and let 5i be the subset of the grid boundary D having neighbors only in Ri. Here, D 
is defined as G \ R. Obviously, Si fl 5j = if i ^ j, and Yuj=i l^il — l^-^l — 1^1 • 

Now we choose the Ri such that \5Ri\ = o > 8dS, where a is specified below, and let v 

equal max|i?j|. Consequently, we have k > \R\/v = f V/u, and 



P 



>5E«-ni)-*s=*(s-»)-sEw^s 

i=l i=l 



(3) 



We subsequently choose a in such a way that 



a 



\50(d,t)\ = J2^ k 



k=l 



t 

k-l 



> 8dS 



(4) 



for some t, where 0(d, t) is the standard <i-dimensional octahedron of radius t (see Appendix 
A). It follows from Equation [2T|, Appendix A, that t can be chosen in such a way that o is 



4 



less than 8d(2d + 1)S. Now the value of v can be estimated using the isoperimetric property 
of the octahedron (see again Appendix A), namely: v < \0(d,t)\. Hence, we find 

Z > ' OT '"- f " > C , S -* (5) 

v - 8d(2d+l)i/ _ 8d{2d+l)\0{d,t)\ ~ KJ 

where c& equals l/(d(2d+ l)2 d+2 ). This gives the following lower bound: 

We also have V + |D| = |G| and \D\ < 2d\G\/l, where / is the smallest size of the grid. This 
gives the final lower bound /i for the total number of elements of u to be read into the cache: 

»>V + p>v(l + c d S-^-±)>\G\(l-^ + (l-^)c d S-^) . (7) 



21 J ~ V I I 

In general, assuming that the cache associativity a is larger than the diameter of the 
operator K, the order of this lower bound can not be improved, as shows the following 
example (remember that our lower bound is valid for a cache with any associativity, including 
a fully associative cache). Let the spatial extents of a two-dimensional grid be rii and n 2 , 
respectively, with m equal to kS and n-i arbitrary, and perform calculations of the star stencil 
(i.e. r = 1) in the following order: 

do i = 0, k*a-l 
do j=2, n 2 ~l 

do il = max(2,l+i*(S/a)) , min(ni-l , (i+l)*(S/a)) 

q(il,j) = u(il,j) + ••■ 
end do 
end do 
end do 

Since n\ equals kS, all values of q and u having the same value of the second index are 
mapped into the same cache location within a set. Since a exceeds 2r + 1, none of the values 
required for the computation of q will be replaced in the cache, except those at a distance 
r around the line defined by il = i*S/a. The total number of elements of u read into the 
cache for execution of this loop nest will therefore be nin 2 + (n 2 — 2)2r(ka — 1) — 4, which 
equals nin 2 (l — 2/n\ + 2a(l — 2/n 2 )/S). Similar examples in higher dimensions show that 
the order of our lower bound (Equation [7]) can not be improved. 

4 An upper bound for cache loads for local operators. 
Cache fitting algorithm 

In order to obtain an upper bound we present a cache fitting algorithm which has a small 
number of replacements. We find a set P of cache conflict-free indices of u and calculate Ku 
at the points of P. Then we tile the index space of u with P to minimize the total number of 
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replacements. For the analysis we assume an cache associativity of one, which is the worst 
case for replacement loads. 

Let L be a set in the index space of u having the same image in cache as the index 
(0, . . . , 0), Figure ||. L is a lattice in the sense that there is a generating set {hi}, i — 1, . . . , d, 
such that L is the set of grid points {(0, . . . , 0) + Ylt=i x ibi \ Xi G Z}. We call this the 
interference lattice of u. It can be defined as the set of all vectors . . . , id) such that 

it + nii 2 + H V n>i ■ ■ ■ n d -\id = mod S . (8) 

In @ this lattice is defined as the set of solutions to the cache miss equation. 

Let P be a fundamental parallelepiped of For future reference we note that vol(P) = 
detL = S. The second equality follows form the fact that L has a basis {vj} of the form: 

i 

Vi = Sei, v, = -rriiei + , 2 < i < d , m i+ i = Y\ n j (9) 

3=1 

Obviously, the vectors v\ satisfy Equation [|. Conversely, any vector satisfying Equation |8| can 
be represented as a linear combination of v 1; . . . , v^, with coefficients Xk = ik for k — 2, . . . , d, 
and Xi = {i\ + m 2 «2 + • • • + fndidjS' 1 . X\ is an integer number, since ii + m 2 «2 + • • • + ra^id 
is divisible by S according to Equation ||. Since v 1; . . . , v d are linearly independent vectors, 
they form a basis of the lattice. 

Let F be a face of P (see Figure §), an d let v be a basis vector of L such that P = 
{{ + x\ | f G F, < x < 1}. Then shifts F + (k/g)v, k = . . . , —1, 0, 1, . . . contain all integer 
points of a pencil Q, with Q = {f + iv | f G F,x is any number} for an appropriate value 
of gn The values of q at the points of Q can be computed without replacing reusable values 
of u except at a distance of r or less from the boundary of Q. Let h\, . . . , h s be the signed 
projections along F of the stencil vectors of K onto v, and let h + and /i_ be the maximum 
and the minimum of the projections, respectively. We assume also that \h + — h_\/g < |v|a, 
meaning that the extent of P in the direction of v is big enough to allow to compute q on 
F without replacements. It may be impossible to satisfy this condition when the shortest 
vector in L is shorter than the diameter of K divided by the cache associativity. Lattices 
with short vectors are discussed in Section |6|. The associated grids are called unfavorable 
grids. 

The Cache Fitting Algorithm for computing q is as follows (see Figure here K(R) is 
the set of points where u must be available in order to compute q in all points of R (i.e. the 
-fT-extension of E): 

set w = (1/g) v 
do Q = Qmin, Qmax 

determine face F inside pencil Q 

*A fundamental parallelepiped of a lattice L is a set of points {^f=i x J°i I < < 1} for any basis {b^} 
ofL. 

TLet / be a fundamental parallelepiped of the integer lattice in the subspace Y generated by F, and let 
e be an integer vector such that e and the basis of / generate Z d . Obviously, g must be chosen in such a 
way that vol((l/fir)v, I) = vol(e, I) = 1. Hence, g = vol(v,J) = (l/|F|)vol(v, F) = vol{P)/\F\, where |F| is 
the index of the lattice L n Y in the integer lattice of Y. 
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do k = kmin, kmax 

load in cache all values of u inside K(F + k * w) 
compute q at F + k * w 
end do 
end do 

In this algorithm the parameters Qmin, Qmax, kmin and kmax are determined such that 
the scanning face F sweeps out the entire grid. Whenever a point is not contained in the 
grid, it is simply skipped in the nest. 

Since we defined the algorithm in such a way that the scanning face in the direction of v 
with step g passes through all integer points of Q, the values of q at all points inside Q will 
be computed. 



Figure 2: The Interference Lattice. Cache fitting set F + kw, k e Z, sweeps across pencil 
Q in the direction of v. Only values of u at points at a distance r or less from the pencil 
boundaries (3\ and f3 2 will be replaced in the cache when K is evaluated inside of Q. 

Replacements misses can occur only at points at a distance of r or less from the boundaries 
of the pencils. For each of these points at most s replacement need to take place, where s, 
the size of the stencil, is defined by s = \K\ < (2r + l) d . So the number of replacements will 
not exceed r(2r + l) d A, where A is the total surface area of all pencils. 

To minimize A we choose P so that Q has a good surface to volume ratio. Let P be the 
fundamental parallelepiped of a reduced basis of L. A basis bi, . . . , ha of a d-dimensional 
lattice L is called reduced if 





(10) 
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where q is a constant which depends only on <i[| Let hi be the shortest vector of the basis, 
and let the eccentricity e of the basis be defined by e = max(||bj||/||bi||). If we define dP 
as the surface of P, we can derive an estimation for the surface-to-volume ratio of P: 

m < 2 ^ INI < < < «4 (n ini)"*< . (id 

where we twice used the Hadamard inequality: Yli 1 1^1 1 > det L, and the abovementioned 
fact that L = S. The constant c' d is defined by c' d = 2dcd- 

Since A does not exceed the surface area of all fundamental parallelepipeds covering 
the grid, the total number of these parallelepipeds (which equals \G\/ detL) gives us: A < 
\dP\\G\/ detL, so that the total number of replacements p can be bounded by: p < r(2r + 
l) d |<9P| \G\/ detL. This, combined with Equation [U], gives an upper bound for the total 
number of elements to be loaded into the cache in the cache fitting algorithm: 

H< \G\+ P < \G\ (l+ecgS-*) , (12) 

where c d is defined by c' d = r(2r + l) d c' d . 

Note that if the shortest vector in the interference lattice has length (S/ f) l ^ d for some 
constant / it follows that e < /c<j. To show this, we sort the basis vectors in Equation [K| in 

ascending order. Then it follows that j d \\h d \ \ < cjS, and hence e = 11^4! < /q. 

In Appendix B we show that there are grids whose interference lattices feature /'s that 
are independent of S (provided that S is a prime power, which is true in most practical 
cases). For these lattices the relative gap between the upper bound (Equation |T^) and the 
lower bound (Equation 0) of the previous section goes to zero as S increases. When the cache 
associativity exceeds the diameter of K, this gap can be closed. In that case a parallelepiped, 
built on a reduced basis of the interference lattice of the array indices with Xd = 0, can be 
swept in the d th coordinate direction, similar to the example at the end of Section |3|. In 
general, the cache fitting algorithm gives full cache utilization, in contrast to the algorithm 
for finding grid-aligned parallelepipeds devoid of interference lattice points, as proposed in 
|Q. See Table 2, ||, where the sizes of blocks without self interference are approximately 
20% smaller than S. 



5 Lower and upper bounds for multiple RHS arrays 

In this section we consider the case where there multiple arrays involved in the computation 
of q. Let p be the number of arrays (we call these the RHS arrays), all having the same sizes, 
and let the stencil of each RHS array include the star stencil. This means, in particular, 
that for each boundary point of any region Ri (see Figure Q) values of p RHS arrays are 
necessary^ for computation of q in Ri. Hence, we have to replace at least pi values, with 

•fEvery lattice has a reduced basis. There is a polynomial algorithm to find a reduced basis with a constant 
Cd = 2 d(d-i)/4 Q Ch 6 2 ] 

§As in Section [|, we assume that computation of q is performed in a pointwise fashion. In this case 
elements of all RHS arrays have to be loaded into cache simultaneously, reducing the cache size effectively 
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Pi = max(p(^2j <i | .By |) — 5,0) values of RHS arrays. Now we can repeat the arguments of 
Section^ with \ V\ and \G\ replaced by p\V\ and p\G\, respectively, and 5 replaced by \S/p], 
to obtain the following lower bound for the number of cache loads for stencil computations 
with p RHS arrays: 



P>p\V\+p > p\V\ \l + c d 




.., . , 2d -I ( . 2d\ 
> p \ G \ |i _ + h__) Q 



(13) 



In order to obtain an upper bound for cache loads for calculations with p RHS arrays, 
we assume that we are free to choose relative array offsets. Our upper bound is valid on 
the assumption that the stencil diameter divided by the cache associativity is smaller than 
the length of the longest lattice basis vector divided by p. Consider a stripwise tiling of 
the fundamental parallelepiped P for the lattice L, see Figure [3[ Each tile Pi has the same 
size and shape. The size is determined by considering the longest edge vector v in the 
fundamental parallelepiped and dividing it into p equal pieces of size [(5/|F|)/p]| |v| |, so that 
each tile contains |F|[(5/|F|)/p] integer points, where \F\ is the number of integer points in 
the face. The remainder part of the tiling is indicated by the shaded area. The reason why 
the longest edge vector is selected for subdivision is as follows. Since we use a reduced basis, 
the smallest angle between v and F is bounded from below, so the parallelepiped is always 
close to orthogonal. Therefore, subdividing the longest edge leads to tiles with the largest 
inscribed sphere, and thus the largest difference stencil fitting inside the tile. 

Let {Pi} be the parallelepipeds of the tiling, and let s< be the address offset of Pi relative 
to Pi (corresponding to the same RHS array). We assign one parallelepiped to each RHS 
array and choose starting addresses of the arrays, addrj, in such a way that images of tiles 
Pi in the cache do not overlap: addr« = addri + rriiS + Si, where mi = s± = 0, and 
rrti = mj_i + [" l v l~ a * +8 »- 1 "| ; % = 2, . . . , p. Sweeping through the pencil by units of tile P\ in the 
direction of v we can compute Ku without any cache conflicts, except on the boundary of 
the pencils. The number of replacement loads of this algorithm can be estimated similarly 
to the number of replacement loads of a single-array algorithm, taking into account that for 
calculation of a value u at any point values of all p RHS arrays in the neighbor points may 
have to be in cache, thus reducing the effective cache size to [S/p]\ 



P <p\G\ +p <p\G\ 1 + ec^' 



(14) 



where d' d is a constant which depends only on d, and e is the eccentricity of L. 



by a factor of p. Non-pointwise computations may be performed if the operator K is separable, in the sense 
that it can be written as K {u\, u%, . . . , u p ) = K\{u\, K%{u2, . . . K p (u p ) . . .). In the case of separability of K 
the stencil operation can be split into a succession of independent operations, each involving an intermediate 
value of q and one RHS array. This would not require to load all p RHS arrays in cache at each point. 
Instead, it would suffice to write intermediate values of q into main memory, and then load them back into 
cache for completion of the computations. This results in a larger effective cache size, but more data to be 
loaded, so splitting the operation need not improve the total number of loads. 
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Figure 3: Tiling of a fundamental parallelepiped of a reduced basis of the lattice L. We 
assume that \h + — h-\ < a\v/p\ (a is the cache associativity). The tiling effectively reduces 
the size of the parallelepiped by a factor of at most 2p (since x/p > [x/p\ > x/(2p)), and 
increases the cost of a replacement in the cache per point of the boundary of the pencil by 
at most a factor of p, since elements of all p RHS arrays will be replaced at the same time. 

6 Unfavorable array sizes 

We have implemented our cache fitting algorithm and compared its actually measured num- 
ber of cache misses with those of the compiler-optimized code for the corresponding naturally 
ordered loop nest on a MIPS R10000 processor (SGI Origin 2000). For comparison we chose 
a second order difference operator (the common 13-point star stencil) an a test set includ- 
ing three-dimensional grids of sizes 40 < ri\ < 100, ri2 = 91, and 713 = 100 (the value of 
the second dimension was chosen to show a typical picture; that of the third dimension is 
irrelevant). A plot of measured cache misses for both codes is shown in Figure (|. The pro- 
gram was compiled with options "-03 -LN0 :pref etch=0," using the MlPSpro f77 compiler, 
version 7.3.1.1m. The prefetch flag disables the prefetching compiler optimization. Without 
this option the number of cache misses increases significantly, because the compiler does 
aggressive prefetching to try to reduce execution time. 

The upper bounds for the cache misses from the previous sections would suggest that the 
number of replacement cache misses will increase in the cases where the interference lattice 
has a very short vector. Very short means that the length is smaller than the diameter of the 
operator divided by the cache associativity. In this case the self interference would increase 
significantly. This result suggests how to pad arrays to improve cache performance: the 
padding should be organized in such a way that the shortest vector in the lattice is not too 
short, though short enough to minimize the number of pencils (large index of scanning face 
F). The sweeping is organized such that pencils are as wide as possible (i.e. the smallest 
total number of pencils), while avoiding — in the case of multiple RHS arrays — tiles that are 
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Figure 4: Plot of measured cache misses for 40 < n\ < 100, n 2 = 91 for 13-point star stencil. 
The top line corresponds to the naturally ordered nest, optimized by the SGI Fortran com- 
piler. The bottom line corresponds to our cache fitting algorithm. A typical ratio between 
the two is 3.5. The large fluctuations correspond to grids with short lattice vectors (rii = 45 
and rii = 90 yield shortest vectors (1,0,1) and (2,0,1), respectively). The fluctuations of 
cache misses of the cache fitting algorithm for such grids can be so big that their cache misses 
become more numerous than for the compiler-optimized nest. 



thinner than the diameter of the stencil operator divided by the cache associativity. 

To demonstrate these unfavorable grids we again choose the second order stencil and 
force computations in the nest to follow the natural order^|. Figure |5]a shows the correlation 
between spikes in the number of cache misses and the presence of a very short vector in the 
lattice. We call these lattices unfavorable for cache utilization. Arrays having such lattices 
should be avoided on the target machine. When the shortest vector of the interference lattice 
is shorter than the diameter of the operator, the number of cache misses sharply increases. 
The application developer should avoid such unfavorable array sizes, and compilers should 
avoid the sizes using appropriate padding of array dimensions. Note that similar unfavorable 
cache effects have been mentioned in [111. 



^This forcing is accomplished by introducing a dependence through a Fortran subroutine that performs 
a circular shift of its arguments 
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Figure 5: Plot A shows measured fluctuations of cache misses (above 15% of the upper 
bound). Plot B shows the interference lattices with short (less than 8 in the Li norm) 
vectors. Array sizes are 40 < ni,n 2 < 100. The plots can be fitted well by hyperbolae 
defined by 1,2,3,4, meaning that arrays with unfavorable size are those 

whose z-slices are (close to) multiples of half the cache size. The horizontal line in Plot A 
shows the position of the graph from Figure 



7 Conclusions and future work 

We have demonstrated tight lower and upper bounds for cache misses for calculations of an 
explicit operator K on a structured grid. Our lower bound is valid in the general case of 
fully associative caches, and is based on a discrete isoperimetric theorem. Our upper bound 
is based on a cache fitting algorithm which uses the fundamental parallelepiped of a special 
basis of the interference lattice to fit the data in the cache. The upper bound assumes that 
the shortest vector in the interference lattice is not too short. We have shown that there are 
grids whose interference lattices have this property. We have also shown that the presence 
of a very short vector in the lattice correlates with fluctuations of actual cache misses for 
calculation of a second order explicit operator on three-dimensional grids. The fluctuations 
occur on grids with unfavorable sizes, i.e. on those whose product of the first two dimensions 
is (close to) a multiple of half the cache size. 

Our results can be extended straightforwardly to implicit stencil computations (i.e. those 
of the form q <— K(q)) when the problem has a one- dimensional data dependence. Such a 
data dependence exists if computations of q at grid points can take place in an arbitrary 
order, except that there is a single index i for which q(x±, . . . , i, . . . , Xd) must be evaluated 
before q(xi, . . . ,i + a, . . . , Xd) can be calculated (the constant a is either +1 or -1). Clearly, 
the lower bound is not affected by the implicitness of K. The previously derived upper bound 
can still be achieved by prescribing the proper visit order of points within each parallelepiped, 
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of the scanning face direction within each pencil (positive or negative sweep direction), and 
of the visit order of subsequent pencils. This is always possible for a one- dimensional data 
dependency. 

Our results can also be extended to arrays that store more than one word per grid point 
(tensor arrays). The lower bound of Section [3] for operations with multiple right hand sides 
immediately applies to tensor arrays. The upper bound of that section also applies, provided 
the tensor components can be stored as independent subarrays. 

In a future study we plan to extend the results of this paper to more general implicit 
operators, to operators on unstructured grids, and to tensor arrays with restricted storage 
models. We intend to study more closely the dependence of cache misses on the size of the 
operator's stencil. We also plan to enhance the presented results by taking into account 
a secondary cache and TLB, and to formulate bounds for cache misses more directly than 
through the determination of cache loads. 



Appendix A: The simplex and the octahedron 

In this section we list some basic facts on the number of integer points in the octahedron 
and simplex. The standard octahedron is defined as: 



0(d,t) = jx G Z d | N ^ * j 



(15) 

L J 

and the standard simplex as: 

S(d, t) = jx G Z d | < xi, . . . , x d , \xi\ < t\ . (16) 



i=i 

If we consider sections of the octahedron by planes x\ — k, k — —t, . . . , t, then for the 
number of integer points in the octahedron we get the following recurrence relation: 

t-i 

\0(d,t)\ = |0(d-l,*)| + 2^|O(d-l,Jfe)|. (17) 

This relation can be used to prove that 
d 



w)i = E^g)C 

; — n \ / \ 



fc=0 

and that 



\SO(d, t - 1)| = \0(d, t) - 0(d, t-l)\ = Y,2 k ( d \ ( 

k=l \ / V 



k)\k-l 



(19) 



Also, the relation 

\SO(d,t)\ = \80(d,t-l)\ + \SO(d-l,t)\ + \SO(d-l,t-l)\ (20) 
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shows that 



\SO(d,t)\ < {2d+l)\SO{d,t-l)\. (21) 
For the number of integer points in the simplex we have the following recurrence relation: 

\S(d,t)\ = \S(d-l,t)\ + \S(d,t-l)\. (22) 
This can be used to prove that cf. 0, Table 169, see also [|12|], Section 5: 



\s(d,t)\=j2 



k=l 



d + t 

d 



(23) 



From Equations [T| and [H| it follows that \0(d,t)\ < 2 d \S(d,t)\. Also, since 50(d,t - 1) 
contains at least two nonoverlapping simplices S(d — l,t) and can be covered by 2 d such 
simplices, we see that 



2\S(d-l,t)\ < \SO(d,t-i)\ < 2 d \S(d-l,t)\, d>2 
Hence, if \S(d — l,t)| equals S, we have for d>2: 



\80{d,t)\ \50{d-l,t)\ 2\S{d-l,t)\ 



-d+l 



\0(d,t)\ 



\0(d,t)\ 



2 d \S(d,t)\ 



1 + 



d 



> 2 



-d+l c- 



s-—i 



(24) 



(25) 



since from Equation |23| it follows that if \S(d — l,t)\ does not exceed S, then 1 + t/d does 
not exceed S 1 ^^. 

The isoperimetric inequality |12], Theorem 2, asserts that the size of the boundary of a 
subset R in Z d is at least as big as the size of the standard sphere that contains \R\ pointsJJ]. 
It is easy to see that any standard sphere is sandwiched between two standard octahedrons 
whose radii differ by one. Since the octahedron has the largest volume for a given fixed-size 
boundary, Inequality is true for any lattice body with a boundary of size S. 



Appendix B: The existence of grids with favorable lat- 
tices 

In order to prove that for every cache of size S = p n , where p is a prime number, there are 
grids with interference lattices whose shortest vector has a length I greater than {S/f) x ' d , 
with / independent of S, we show: 

a. For every dimensionality d there exists a lattice L of the same dimension whose basis 
has the form given in Equation |9] (Section f|), and whose shortest vector is sufficiently 
long, and 

b. a grid can be constructed that has L as its interference lattice. 



"The standard sphere, defined in 12 , is the integer point set of minimal surface area for any given number 
of interior points. 
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Corollary: Since grids with dimensions rii + k{S, i = 1, . . . , d have the same interference 
lattice for any non- negative integers ki, any grid can be embedded in a favorable larger grid. 



Proof: 



Step a: Let a lattice L have a basis of the form of Equation |9|. Any lattice vector x, which 
includes all basis vectors of L, with L°° norm at most / must be a solution of the 
following system of inequalities: 



\xi\ < I , i = 2, . . . , d 
\Sxi + m 2 x 2 H h m d x d \ < I 

Existence of a solution to this system is equivalent to that of the system 



(26) 



2 d 



m 2 , m d 

-^-x 2 + ■■■ + —x d 



I 

< — 

~ S 



(27) 



where ||z|| is the distance from z to the nearest integer number. Theorem VIII, Ch. 1 
states that there are real numbers fi 2 , . 
d, such that 



\\i 2 x 2 H hfidPdW > 



l d ~ 



/j d , and a constant d' d ' depending only on 

(28) 



for all nonzero x satisfying < I , i — 2, . . . , d. 

If we choose the nonzero integers rrii in such a way that |m< — < 2 for i = 2, . . . , d, 

(29) 



then we get 
m 2 



x 2 + 



, rn d 
+ ^x d 



> 



It- 



id -1) 



S 



(d-r 



i 

3 



which shows that Equation ^6|has no integer solutions if I < (c d /{Sd)) 1 ' d . Hence, / in 
Section § can be chosen as: / = d/c d , and the lattice with the basis given by Equation 
[| has a reduced basis with eccentricity depending only on d. 



Step b: In order to find a grid whose interference lattice is L, we first sort the m.j in order of 
increasing gcd(mj, S). Since we assume that S = p n , where p is prime, we know that 
gcd(m,, S) divides gcd(m i+1 , S), and the appropriate grid dimensions rij can be found 
directly by solving the congruencies (n^mj — m i+1 ) mod S = 0. 
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